We start by importing the libraries we will use later in the project.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import json
from urllib.request import urlopen
import seaborn as sns
sns.set(style = "whitegrid",
color_codes = True,
font_scale = 1.5)
import plotly.express as px
import plotly.graph_objects as go
from sklearn import linear_model as lm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import preprocessing
We then read the datasets that we will use throughout this project.
We also brought in 2 extra datasets from the USDA's county level data sets (link: https://www.ers.usda.gov/data-products/county-level-data-sets/download-data/).
#states = pd.read_csv('finalproj/data/5.12states.csv')
counties = pd.read_csv('abridged_couties.csv')
confirmed = pd.read_csv('new_time_series_covid19_confirmed_US.csv')
deaths = pd.read_csv('new_time_series_covid19_deaths_US.csv')
#Following data sets taken from the USDA's county level data sets
poverty = pd.read_csv('PovertyEstimates.csv')
unemployment = pd.read_csv('Unemployment.csv', thousands=',')
counties.head()
confirmed.head()
deaths.head()
poverty.head()
unemployment.head()
The first step is creating a merged dataset with each county's information, the number of confirmed cases, the number of deaths, and other relevant information from other datasets.
We create a copy to ensure that the original counties dataframe is intact.
Since the "countyFIPS" column currently is type str, we convert it to numbers so we can match it with the "FIPS" columns in confirmed and deaths. We also drop any empty values for FIPS since all valid counties have a FIPS value.
The FIPS columns serve as a primary key since it's a standardized id to identify counties that is present in most county datasets even online.
We first inner merge merged with confirmed on the FIPS columns, specifically only looking at the "5/12/20" column since that contains the most updated number of cases per county. We then rename this to a more usable name so it'll be easier to reference later.
We then similarly inner merge the new merged with deaths on the FIPS columns, specifically only looking at the "5/12/20" column since that contains the most updated number of deaths per county. We then rename this to a more usable name so it'll be easier to reference later.
merged = counties.copy()
merged["countyFIPS"] = pd.to_numeric(merged["countyFIPS"], errors='coerce')
merged = merged.dropna(subset=["countyFIPS"])
#merge w confirmed
merged = merged.merge(right=confirmed[["FIPS", "5/12/20"]], how='inner', left_on='countyFIPS', right_on='FIPS')
merged = merged.rename(columns={'5/12/20' : 'confirmed'})
#merged w deaths
merged = merged.merge(right=deaths[["FIPS", "5/12/20"]], how='inner', left_on='countyFIPS', right_on='FIPS')
merged = merged.rename(columns={'5/12/20' : 'deaths'})
#drop irrelevant columns
merged = merged.drop(["FIPS_x", "FIPS_y"], axis = 1)
merged
Here, we merge the two USDA datasets in a similar fashion to what we did above. We convert the FIPS columns in poverty and unemployment to numeric data again and select the columns from each that we want to use in our analysis before performing an inner merge.
merged2 = merged.copy()
poverty["FIPStxt"] = pd.to_numeric(poverty["FIPStxt"], errors='coerce')
unemployment["FIPStxt"] = pd.to_numeric(unemployment["FIPStxt"], errors='coerce')
#merge w poverty to add and rename pov_pct
merged2 = merged2.merge(right=poverty[["FIPStxt", "PCTPOVALL_2018"]], how='inner', left_on='countyFIPS', right_on='FIPStxt')
merged2 = merged2.rename(columns={"PCTPOVALL_2018" : "pov_pct"})
#merge w unemployment to add and rename unemploy_rate and median_household_income
merged2 = merged2.merge(right=unemployment[["FIPStxt", "Unemployment_rate_2018", "Median_Household_Income_2018"]], how='inner', left_on='countyFIPS', right_on='FIPStxt')
merged2 = merged2.rename(columns={"Unemployment_rate_2018" : "unemploy_rate", "Median_Household_Income_2018": "med_income"})
We create a new copy of all the merged datasets. We want to first add a case_rate column, which is just the # of cases divided by the total population of a county. We also multiply this by 100 to make it easier to read.
We want to then add a death_rate column, which is the # of deaths divided by the # of confirmed cases. Before we this, we remove the 272 counties that have 0 confirmed cases so we don't have a division by 0 error. Since this project is also aiming to look at effects of certain factors on the case_rate and death_rate, counties with no cases don't add anything to our data. We also multiply this by 100 to make it easier to read.
Other columns we added:
old is the percent of the population that is age 65+inMedicare is the percent of Medicare eligible people who are actually enrolledmedicare_rate is the percent of the population that is eligible for MedicareWe also noticed in the dataset that counties with latitude and longitude equal to 0 or missing were other territories or not valid counties or lacking important information, so we remove all instances where that is true.
We decided we should analyze which columns were missing more data than others, which ended up including include ["3-Yr Diabetes", all the "3-YrMortality", "stay at home" (396), "HPSA" (1013)].
We also checked on what states were included in the data after cleaning. The data now covered 48 states (excluding Alaska and Hawaii since they're not continental and including District of Columbia).
#add case rate
data = merged2.copy()
data["case_rate"] = data["confirmed"] / data["PopulationEstimate2018"]
data["case_rate"] = data["case_rate"] * 100
#add death rate
data = data.loc[data["confirmed"] != 0]
data["death_rate"] = data["deaths"] / data["confirmed"]
data["death_rate"] = data["death_rate"] * 100
#add age column
data["old"] = (data['PopulationEstimate65+2017'] / data["PopulationEstimate2018"]) * 100
#add medicare cols
data = data.dropna(subset=["MedicareEnrollment,AgedTot2017"])
data["inMedicare"] = (data["MedicareEnrollment,AgedTot2017"] / data["#EligibleforMedicare2018"]) * 100
data["medicare_rate"] = (data["#EligibleforMedicare2018"] / data["PopulationEstimate2018"]) * 100
#add hospitals and icu_beds per 1000 people
data["hospitals/1000ppl"] = (data["#Hospitals"] / data["PopulationEstimate2018"]) * 1000
data["icu_beds/1000ppl"] = (data["#ICU_beds"] / data["PopulationEstimate2018"]) * 1000
#drop null or 0 latitudes
data = data.loc[data["lat"] != 0]
data = data.dropna(subset=["lat"])
data.isna().sum()
#note: columns with many na values include ["3-Yr Diabetes", all the "3-YrMortality", "stay at home" (396), "HPSA" (1013)]
data.State.unique()
#48 states (excluding Alaska and Hawaii since they're not continental and including District of Columbia)
data
Since a big portion of the news today is focused on stay at home orders, we wanted to see if there was any obvious effect of when the stay at home orders were put in place and the rates of cases and deaths in a county.
After calculating a more readable number of days for the "stay at home" column, we made boxplots for each of case_rate and death_rate. We decided to hide outliers to make the plot more readable, and there were many outliers that fell way beyond the current graph axes.
dates = data.copy()
#737427 is the proleptic Gregorian ordinal of January 1, 2020
#we subtract that so we can see how many days into the new year the order was put into place
dates["stayHome"] = dates["stay at home"] - 737427
#drop all counties where there is no stay at home order established since this is about timing
dates = dates.dropna(subset=["stayHome"])
print("Stay at home dates range (in days since Jan. 1, 2020): ", "March 17, 2020 (", min(dates["stayHome"]), " days) - April 5, 2020 (", max(dates["stayHome"]), " days)")
#boxplot of case_rate
plt.figure(figsize = (15, 8))
sns.boxplot(x="stayHome", y = "case_rate", data=dates, showfliers=False)
plt.xlabel("# Days since Jan. 1, 2020 that Stay at Home orders were put into place")
#boxplot of death_rate
plt.figure(figsize = (15, 8))
sns.boxplot(x="stayHome", y = "death_rate", data=dates, showfliers=False)
plt.xlabel("# Days since Jan. 1, 2020 that Stay at Home orders were put into place")
We found the dem_to_rep_ratio column interesting, and since many people on either side have differing opinions about the pandemic and social distancing, we thought it would be interesting to see if there was any correlation to case_rate or death_rate.
We categorized the ratios into democratic and republican and then created boxplots for each factor.
political = data.copy()
political = political.dropna(subset=["dem_to_rep_ratio"])
#categorize
political.loc[political["dem_to_rep_ratio"] > 1, "party"] = "Dem"
political.loc[political["dem_to_rep_ratio"] < 1, "party"] = "Rep"
political[["party"]]
#make boxplot for case_rate
plt.figure(figsize = (5, 5))
sns.boxplot(x="party", y = "case_rate", data=political, showfliers=False)
#make boxplot for death_rate
plt.figure(figsize = (5, 5))
sns.boxplot(x="party", y = "death_rate", data=political, showfliers=False)
Here, we just wanted to know which counties had the most confirmed cases as of 5/12. We weren't surprised to see New York, of course, but many of the other names were unexpected.
We also calculated the mean and median number of confirmed cases per county. Since we already removed the 0 case counties, we were able to get a better value. The mean was obviously influenced by large outliers like the top 10 shown here. The median was interesting since we didn't expect it to be low because we usually only hear about the counties with large numbers of cases on the news. However, it makes sense that most counties don't have as many, especially ones in rural areas.
print("Mean number of confirmed cases per county: ", np.mean(data["confirmed"]))
print("Median number of confirmed cases per county: ", np.median(data["confirmed"]))
print("Total number of counties: ", data.shape[0])
#looking at the counties with the most cases and deaths
most_cases = data.sort_values("confirmed", ascending = False)
most_cases[["CountyName", "StateName", "confirmed"]].head(10)
As we considered which factor to predict, we created some histograms to observe the distribution of case_rate.
hist_cases = data.copy()
hist_cases = hist_cases[["case_rate"]]
hist_cases.hist()
plt.title("Distribution of case_rate in all counties")
plt.xlabel("case_rate")
plt.ylabel("number of counties")
hist_cases.hist(range = (0,1))
plt.title("Distribution of case_rate within the 0-1 range")
plt.xlabel("case_rate")
plt.ylabel("number of counties")
hist_cases.hist(range = (0, 0.2))
plt.title("Distribution of case_rate within the 0-0.2 range")
plt.xlabel("case_rate")
plt.ylabel("number of counties")
hist_cases.loc[hist_cases["case_rate"] > 1].shape[0]
As we considered which factor to predict, we created some histograms to observe the distribution of death_rate.
hist_deaths = data.copy()
hist_deaths = hist_deaths[["death_rate"]]
#removed 0% death rates so we can look closely at the distribution of actual death rates
hist_deaths = hist_deaths.loc[hist_deaths["death_rate"] != 0]
hist_deaths.hist()
plt.title("Distribution of death_rate in all counties")
plt.xlabel("death_rate")
plt.ylabel("number of counties")
hist_deaths.hist(range = (0,20))
plt.title("Distribution of death_rate within the 0-20% range")
plt.xlabel("death_rate")
plt.ylabel("number of counties")
hist_deaths.hist(range = (0, 10))
plt.title("Distribution of death_rate within the 0-10% range")
plt.xlabel("death_rate")
plt.ylabel("number of counties")
hist_deaths.loc[hist_deaths["death_rate"] > 20].shape[0]
We noticed that there were many counties with no cases or no deaths, and we wanted to visually see where these counties were to see if there was a particular state or region that especially lacked cases or deaths.
We created a scatterplot based on latitude and longitude and plotted all counties with cases first and then all counties with deaths.
Points where you can just see the blue dot shows there are no deaths in the area. We also played with the alpha value so that the opacity was lower and you could see where cases or deaths were especially concentrated.
maps = data.copy()
#plot all counties with cases
plt.scatter(maps["lon"], maps["lat"], alpha = 0.04)
#remove counties with no deaths and plot the ones with any deaths
maps = maps.loc[maps["deaths"] != 0]
plt.scatter(maps["lon"], maps["lat"], alpha = 0.1)
#add and re-position the legend
plt.legend(["counties with cases", "counties with deaths"], loc="center left", bbox_to_anchor=(1, 0.5))
income = data.copy()
#line plot (hover to show information)
fig = px.line(income, x = "med_income", y = "death_rate", hover_name="CountyName")
fig.show()
We wanted to get a closer look at that trend, so we did some more thorough investigating. We then grouped the income by the thousands so we would have fewer data points and created a scatterplot with a best fit line to show the correlation even more.
Coupled with the case rate graph, the death rate vs median income graph shows that when income is higher, the death rate of the counties is lower because the case rate stays generally constant throughout all income levels.
income = income.sort_values(["med_income"])
#add column to group income by the thousands
income["income (in k)"] = income["med_income"] // 1000
income = income[["med_income", "income (in k)", "death_rate", "case_rate"]]
#remove death_rate = 0 since that isn't representative
income = income.loc[income["death_rate"] != 0]
#group by income groups (of thousands)
incomegroup = income.groupby(["income (in k)"]).mean().sort_values(["med_income"])
fig, ax = plt.subplots(1,2, figsize=(15,5))
#plot case_rate and death_rate
sns.regplot("med_income", "death_rate", data=incomegroup, ax=ax[0])
sns.regplot("med_income", "case_rate", data=incomegroup, ax=ax[1])
We then wanted to see if there was a correlation between the number of ICU beds for every 1000 people and death rate, so we created a scatter plot.
The correlation wasn't that strong, so to use this, we'll have to add on more features.
icu = data.copy()
icu = icu.loc[icu["death_rate"] != 100]
icu = icu.loc[icu["death_rate"] != 0]
icu = icu.loc[icu["icu_beds/1000ppl"] != 0]
plt.scatter("icu_beds/1000ppl", "death_rate", data=icu)
plt.xlabel("icu_beds/1000ppl")
plt.ylabel("death_rate")
plt.title("Death Rate vs ICU Beds per 1000 people")
We finally wanted to see if there was a correlation between the age factors and death rate, so we created another scatter plot.
This one shows a positive correlation, where counties with higher fractions of 65+ people have a higher death rate. The following shows a positive correlation (but less strong), where counties with higher median ages have a higher death rate.
age = data.copy()
age = age.loc[age["death_rate"] != 100]
age = age.loc[age["death_rate"] != 0]
fig = px.scatter(age, x = "old", y = "death_rate", trendline = "ols")
fig.update_layout(
title="Age vs Death Rate",
xaxis_title="% of population with age 65+",
yaxis_title="death rate",
)
fig2 = px.scatter(age, x = "MedianAge2010", y = "death_rate", trendline = "ols")
fig2.update_layout(
title="Age vs Death Rate",
xaxis_title="Median Age",
yaxis_title="death rate",
)
Before we build our model, we decided we needed to further clean our data so the model could be as accurate as possible. We first removed outliers:
Then, we cut the dataset down to any factors that we may choose to use later in our models so it's easier to read and process. We counted how many NaN values were in each of the chosen columns, and then seeing that they were a negligible amount (1 or 2 counties), we dropped all counties that had any NaN values in our chosen columns.
print(data.shape[0])
data.columns
traintest = data.copy()
#remove outliers
traintest.loc[traintest["death_rate"] == 100].shape[0] #there are 2 counties where this is true
traintest = traintest.loc[traintest["death_rate"] != 100]
traintest.loc[traintest["confirmed"] > 10000].shape[0] #there are 20 of these outliers with significantly more cases
traintest = traintest.loc[traintest["confirmed"] < 10000]
traintest = traintest.loc[traintest["deaths"] > 0.00]
traintest = traintest.rename(columns={"PopulationDensityperSqMile2010":"density"})
traintest = traintest[["confirmed",
"deaths",
"death_rate",
"case_rate",
"old",
"MedianAge2010",
"density",
"inMedicare",
"medicare_rate",
"pov_pct",
"unemploy_rate",
"med_income",
"lat",
"lon",
"FracMale2017",
"DiabetesPercentage",
"HeartDiseaseMortality",
"StrokeMortality",
"Smokers_Percentage",
"RespMortalityRate2014",
"hospitals/1000ppl",
"icu_beds/1000ppl",
"SVIPercentile",
"TotalM.D.'s,TotNon-FedandFed2017",
"Rural-UrbanContinuumCode2013",
"#FTEHospitalTotal2017"]]
#print(traintest.isna().sum())
traintest = traintest.dropna()
traintest = (traintest.loc[traintest["confirmed"] > 5])
traintest.loc[traintest["confirmed"] > 5].shape[0]
Next we split our cleaned dataset into a training data set and a testing data set using sklearn. We put 85% of our data into the training set and 15% into the testing set. We picked a random_state so that the split was pseudorandom.
train, test = train_test_split(traintest, test_size=0.15, random_state=42)
train_c = train
test_c = test
# from sklearn import linear_model as lm
train_c
test_c
Here, we defined a function to calculate the root mean squared error (RMSE) given actual and predicted values. This was created while referencing the function defined in lecture.
We also initialized a dictionary to hold the models so we can compare them later, and arrays to hold the RMSEs for training data, cross validation, and testing data so we could compare them at the end. This was also created in reference to the Cross Validation lecture.
Since we are predicting death_rate that variable is set (and this structure makes it easy to switch to case_rate or any other factor we want to predict!).
We also find the range of the death_rates in our data so we can contextualize our RMSE later.
def rmse(actual, predicted):
return np.sqrt(np.mean((actual - predicted)**2))
def standardize(data):
return (data - np.mean(data)) / np.std(data)
models = {}
training_rmse = []
validate_rmse = []
test_rmse = []
predicting = "death_rate"
print(min(train_c[predicting]), " - ", max(train_c[predicting]))
print(np.mean(train_c[predicting]))
standard = standardize(train_c[predicting])
print(min(standard), " - ", max(standard))
We started the process of evaluating what features are best by creating different pairplots to compare death_rate and case_rate to certain factors. Below is the pairplot we created for some economic factors.
(Note: we created several different pairplots but including all of them caused my kernel to crash, so we're only showing one below)
#sns.pairplot(train_c[["death_rate", "case_rate", "pov_pct", "unemploy_rate", "med_income"]])
In terms of general health, the factors we took into account were:
def health_cols(data):
return data[[predicting,
"old",
"MedianAge2010",
"DiabetesPercentage",
"HeartDiseaseMortality",
"StrokeMortality",
"Smokers_Percentage",
"RespMortalityRate2014"
]]
We create the X_train and y_train variables for the model, and then create the Linear Regression model that we're using.
X_train_h = health_cols(train_c).drop([predicting], axis = 1)
y_train = train_c[predicting]
regr = lm.LinearRegression()
X_train_h
In the following cells, we fit the model and use it to predict values based on X_train. We also use the previously defined rmse function to calculate the training error. We also use the score function built into models, although we mostly used the RMSE to determine and score the model's performance.
regr.fit(X_train_h, y_train)
y_fitted = regr.predict(X_train_h)
models["health"] = regr
training_error = rmse(y_fitted, y_train)
training_rmse.append(training_error)
training_error
print(regr.score(X_train_h, y_train))
Next, we perform cross validation on this model using the cross_val_score method from sklearn. Instead of the default scoring, we used one of the built-in scoring methods which was "neg_root_mean_squared_error". Since we didn't actually want the value to be negative we then make them positive again. We then take the mean of these 5 cross validated RMSEs and will use that later to compare.
c_scores = cross_val_score(regr, X_train_h, y_train, cv=5, scoring="neg_root_mean_squared_error")
c_scores = c_scores * -1
c_scores.mean()
validate_rmse.append(c_scores.mean())
c_scores
Finally, we predict our values for our testing data set, and calculate the testing RMSE. We also find the residuals and plot them against the actual values (this was referenced from part homeworks and labs).
X_test_h = health_cols(test_c).drop([predicting], axis = 1)
y_test = test_c[predicting]
y_predicted = regr.predict(X_test_h)
residuals = y_test - y_predicted
ax = sns.regplot(y_test, residuals)
test_error = rmse(y_predicted, y_test)
test_rmse.append(test_error)
test_error
In terms of the healthcare system, the factors we took into account were:
def healthcare_cols(data):
return data[[predicting,
"hospitals/1000ppl",
"icu_beds/1000ppl",
"inMedicare",
"TotalM.D.'s,TotNon-FedandFed2017",
"#FTEHospitalTotal2017",
"Rural-UrbanContinuumCode2013"
]]
The model training, error finding, cross validation, and testing process mirrors the one above for the general health features.
X_train_hc = healthcare_cols(train_c).drop([predicting], axis = 1)
y_train = train_c[predicting]
regr2 = lm.LinearRegression()
regr2.fit(X_train_hc, y_train)
y_fitted = regr2.predict(X_train_hc)
models["healthcare"] = regr2
training_error = rmse(y_fitted, y_train)
training_rmse.append(training_error)
training_error
print(regr2.score(X_train_hc, y_train))
c_scores = cross_val_score(regr2, X_train_hc, y_train, cv=5, scoring="neg_root_mean_squared_error")
c_scores = c_scores * -1
c_scores.mean()
validate_rmse.append(c_scores.mean())
c_scores
X_test_hc = healthcare_cols(test_c).drop([predicting], axis = 1)
y_test = test_c[predicting]
y_predicted = regr2.predict(X_test_hc)
residuals = y_test - y_predicted
ax = sns.regplot(y_test, residuals)
test_error = rmse(y_predicted, y_test)
test_rmse.append(test_error)
test_error
In terms of the economic state, the factors we took into account were:
def econ_cols(data):
return data[[predicting,
"SVIPercentile",
"medicare_rate",
"pov_pct",
"unemploy_rate",
"med_income"
]]
The model training, error finding, cross validation, and testing process mirrors the one above for the general health features.
X_train_e = econ_cols(train_c).drop([predicting], axis = 1)
y_train = train_c[predicting]
regr3 = lm.LinearRegression()
regr3.fit(X_train_e, y_train)
y_fitted = regr3.predict(X_train_e)
models["economic"] = regr3
training_error = rmse(y_fitted, y_train)
training_rmse.append(training_error)
training_error
print(regr3.score(X_train_e, y_train))
c_scores = cross_val_score(regr3, X_train_e, y_train, cv=5, scoring="neg_root_mean_squared_error")
c_scores = c_scores * -1
c_scores.mean()
validate_rmse.append(c_scores.mean())
c_scores
X_test_e = econ_cols(test_c).drop([predicting], axis = 1)
y_test = test_c[predicting]
y_predicted = regr3.predict(X_test_e)
residuals = y_test - y_predicted
ax = sns.regplot(y_test, residuals)
test_error = rmse(y_predicted, y_test)
test_rmse.append(test_error)
test_error
Here we combined all the features from the previous 3 models.
def all_cols(data):
return data[[predicting,
"old",
"DiabetesPercentage",
"HeartDiseaseMortality",
"StrokeMortality",
"Smokers_Percentage",
"RespMortalityRate2014",
"hospitals/1000ppl",
"icu_beds/1000ppl",
"inMedicare",
"TotalM.D.'s,TotNon-FedandFed2017",
"#FTEHospitalTotal2017",
"Rural-UrbanContinuumCode2013",
"SVIPercentile",
"medicare_rate",
"pov_pct",
"unemploy_rate",
"med_income"
]]
The model training, error finding, cross validation, and testing process mirrors the one above for the general health features.
X_train_a = all_cols(train_c).drop([predicting], axis = 1)
y_train = train_c[predicting]
regr4 = lm.LinearRegression()
regr4.fit(X_train_a, y_train)
y_fitted = regr4.predict(X_train_a)
models["all"] = regr4
training_error = rmse(y_fitted, y_train)
training_rmse.append(training_error)
training_error
print(regr4.score(X_train_a, y_train))
c_scores = cross_val_score(regr4, X_train_a, y_train, cv=5, scoring="neg_root_mean_squared_error")
c_scores = c_scores * -1
c_scores.mean()
validate_rmse.append(c_scores.mean())
c_scores
X_test_a = all_cols(test_c).drop([predicting], axis = 1)
y_test = test_c[predicting]
y_predicted = regr4.predict(X_test_a)
residuals = y_test - y_predicted
ax = sns.regplot(y_test, residuals)
test_error = rmse(y_predicted, y_test)
test_rmse.append(test_error)
test_error
Here we compare the previous 4 models based on the training rmse, the cross validation rmse, and the testing rmse. This process was created by referencing lecture code.
#adapted from lecture code
def compare_models(models):
names = list(models.keys())
fig = go.Figure([
go.Bar(x = names, y = training_rmse, name="Training RMSE"),
go.Bar(x = names, y = validate_rmse, name="CV RMSE"),
go.Bar(x = names, y = test_rmse, name="Test RMSE", opacity=.3)])
return fig
training_rmse
fig = compare_models(models)
fig.update_yaxes(range=[4.05,4.52], title="RMSE")
As a note, we did not think of making an appendix at the end until we saw this on Piazza, so not all our ideas and original failed tests and models and features are in this appendix. We only added old models from the last day of us working on it, so this is not all-encompassing
def no_cols(data):
return data[[predicting,
"lat"
]]
X_train_n = no_cols(train_c).drop([predicting], axis = 1)
y_train = train_c[predicting]
regr0 = lm.LinearRegression()
regr0.fit(X_train_n, y_train)
y_fitted = regr0.predict(X_train_n)
models["none"] = regr0
training_error = rmse(y_fitted, y_train)
training_rmse.append(training_error)
training_error
print(regr0.score(X_train_n, y_train))
c_scores = cross_val_score(regr0, X_train_n, y_train, cv=5, scoring="neg_root_mean_squared_error")
c_scores = c_scores * -1
print(c_scores.mean())
validate_rmse.append(c_scores.mean())
c_scores
[test data below]
X_test_n = no_cols(test_c).drop([predicting], axis = 1)
y_test = test_c[predicting]
y_predicted = regr0.predict(X_test_n)
residuals = y_test - y_predicted
ax = sns.regplot(y_test, residuals)
test_error = rmse(y_predicted, y_test)
test_rmse.append(test_error)
test_error
stda = preprocessing.scale(X_train_h)
stda